The objective of this dataset is to build a predictive model for diagnosing diabetes in female patients who are at least 21 years old and of Pima Indian heritage. The model should predict whether a patient has diabetes (Outcome = 1) or does not have diabetes (Outcome = 0) based on several diagnostic measurements, including pregnancies, glucose level, blood pressure,skin thickness, insulin level, BMI, diabetes pedigree function, and age.
#import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score, mean_squared_error, r2_score, roc_auc_score, roc_curve, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import KFold
import warnings
warnings.simplefilter(action='ignore')
sns.set()
plt.style.use("ggplot")
%matplotlib inline
!pip install missingno
Requirement already satisfied: missingno in c:\users\samar\anaconda3\lib\site-packages (0.5.2) Requirement already satisfied: numpy in c:\users\samar\anaconda3\lib\site-packages (from missingno) (1.24.3) Requirement already satisfied: matplotlib in c:\users\samar\anaconda3\lib\site-packages (from missingno) (3.7.2) Requirement already satisfied: scipy in c:\users\samar\anaconda3\lib\site-packages (from missingno) (1.11.1) Requirement already satisfied: seaborn in c:\users\samar\anaconda3\lib\site-packages (from missingno) (0.12.2) Requirement already satisfied: contourpy>=1.0.1 in c:\users\samar\anaconda3\lib\site-packages (from matplotlib->missingno) (1.0.5) Requirement already satisfied: cycler>=0.10 in c:\users\samar\anaconda3\lib\site-packages (from matplotlib->missingno) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in c:\users\samar\anaconda3\lib\site-packages (from matplotlib->missingno) (4.25.0) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\samar\anaconda3\lib\site-packages (from matplotlib->missingno) (1.4.4) Requirement already satisfied: packaging>=20.0 in c:\users\samar\anaconda3\lib\site-packages (from matplotlib->missingno) (23.1) Requirement already satisfied: pillow>=6.2.0 in c:\users\samar\anaconda3\lib\site-packages (from matplotlib->missingno) (9.4.0) Requirement already satisfied: pyparsing<3.1,>=2.3.1 in c:\users\samar\anaconda3\lib\site-packages (from matplotlib->missingno) (3.0.9) Requirement already satisfied: python-dateutil>=2.7 in c:\users\samar\anaconda3\lib\site-packages (from matplotlib->missingno) (2.8.2) Requirement already satisfied: pandas>=0.25 in c:\users\samar\anaconda3\lib\site-packages (from seaborn->missingno) (2.0.3) Requirement already satisfied: pytz>=2020.1 in c:\users\samar\anaconda3\lib\site-packages (from pandas>=0.25->seaborn->missingno) (2023.3.post1) Requirement already satisfied: tzdata>=2022.1 in c:\users\samar\anaconda3\lib\site-packages (from pandas>=0.25->seaborn->missingno) (2023.3) Requirement already satisfied: six>=1.5 in c:\users\samar\anaconda3\lib\site-packages (from python-dateutil>=2.7->matplotlib->missingno) (1.16.0)
!pip install scikit-learn
Requirement already satisfied: scikit-learn in c:\users\samar\anaconda3\lib\site-packages (1.4.2) Requirement already satisfied: numpy>=1.19.5 in c:\users\samar\anaconda3\lib\site-packages (from scikit-learn) (1.24.3) Requirement already satisfied: scipy>=1.6.0 in c:\users\samar\anaconda3\lib\site-packages (from scikit-learn) (1.11.1) Requirement already satisfied: joblib>=1.2.0 in c:\users\samar\anaconda3\lib\site-packages (from scikit-learn) (1.2.0) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\samar\anaconda3\lib\site-packages (from scikit-learn) (2.2.0)
!pip install -U scikit-learn
Requirement already satisfied: scikit-learn in c:\users\samar\anaconda3\lib\site-packages (1.4.2) Requirement already satisfied: numpy>=1.19.5 in c:\users\samar\anaconda3\lib\site-packages (from scikit-learn) (1.24.3) Requirement already satisfied: scipy>=1.6.0 in c:\users\samar\anaconda3\lib\site-packages (from scikit-learn) (1.11.1) Requirement already satisfied: joblib>=1.2.0 in c:\users\samar\anaconda3\lib\site-packages (from scikit-learn) (1.2.0) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\samar\anaconda3\lib\site-packages (from scikit-learn) (2.2.0)
! pip install streamlit
Requirement already satisfied: streamlit in c:\users\samar\anaconda3\lib\site-packages (1.34.0) Requirement already satisfied: altair<6,>=4.0 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (5.3.0) Requirement already satisfied: blinker<2,>=1.0.0 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (1.8.2) Requirement already satisfied: cachetools<6,>=4.0 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (5.3.3) Requirement already satisfied: click<9,>=7.0 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (8.0.4) Requirement already satisfied: numpy<2,>=1.19.3 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (1.24.3) Requirement already satisfied: packaging<25,>=16.8 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (23.1) Requirement already satisfied: pandas<3,>=1.3.0 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (2.0.3) Requirement already satisfied: pillow<11,>=7.1.0 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (9.4.0) Requirement already satisfied: protobuf<5,>=3.20 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (4.25.3) Requirement already satisfied: pyarrow>=7.0 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (11.0.0) Requirement already satisfied: requests<3,>=2.27 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (2.31.0) Requirement already satisfied: rich<14,>=10.14.0 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (13.7.1) Requirement already satisfied: tenacity<9,>=8.1.0 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (8.2.2) Requirement already satisfied: toml<2,>=0.10.1 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (0.10.2) Requirement already satisfied: typing-extensions<5,>=4.3.0 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (4.7.1) Requirement already satisfied: gitpython!=3.1.19,<4,>=3.0.7 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (3.1.43) Requirement already satisfied: pydeck<1,>=0.8.0b4 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (0.9.0) Requirement already satisfied: tornado<7,>=6.0.3 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (6.3.2) Requirement already satisfied: watchdog>=2.1.5 in c:\users\samar\anaconda3\lib\site-packages (from streamlit) (2.1.6) Requirement already satisfied: jinja2 in c:\users\samar\anaconda3\lib\site-packages (from altair<6,>=4.0->streamlit) (3.1.2) Requirement already satisfied: jsonschema>=3.0 in c:\users\samar\anaconda3\lib\site-packages (from altair<6,>=4.0->streamlit) (4.17.3) Requirement already satisfied: toolz in c:\users\samar\anaconda3\lib\site-packages (from altair<6,>=4.0->streamlit) (0.12.0) Requirement already satisfied: colorama in c:\users\samar\anaconda3\lib\site-packages (from click<9,>=7.0->streamlit) (0.4.6) Requirement already satisfied: gitdb<5,>=4.0.1 in c:\users\samar\anaconda3\lib\site-packages (from gitpython!=3.1.19,<4,>=3.0.7->streamlit) (4.0.11) Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\samar\anaconda3\lib\site-packages (from pandas<3,>=1.3.0->streamlit) (2.8.2) Requirement already satisfied: pytz>=2020.1 in c:\users\samar\anaconda3\lib\site-packages (from pandas<3,>=1.3.0->streamlit) (2023.3.post1) Requirement already satisfied: tzdata>=2022.1 in c:\users\samar\anaconda3\lib\site-packages (from pandas<3,>=1.3.0->streamlit) (2023.3) Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\samar\anaconda3\lib\site-packages (from requests<3,>=2.27->streamlit) (2.0.4) Requirement already satisfied: idna<4,>=2.5 in c:\users\samar\anaconda3\lib\site-packages (from requests<3,>=2.27->streamlit) (3.4) Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\samar\anaconda3\lib\site-packages (from requests<3,>=2.27->streamlit) (1.26.16) Requirement already satisfied: certifi>=2017.4.17 in c:\users\samar\anaconda3\lib\site-packages (from requests<3,>=2.27->streamlit) (2023.7.22) Requirement already satisfied: markdown-it-py>=2.2.0 in c:\users\samar\anaconda3\lib\site-packages (from rich<14,>=10.14.0->streamlit) (2.2.0) Requirement already satisfied: pygments<3.0.0,>=2.13.0 in c:\users\samar\anaconda3\lib\site-packages (from rich<14,>=10.14.0->streamlit) (2.15.1) Requirement already satisfied: smmap<6,>=3.0.1 in c:\users\samar\anaconda3\lib\site-packages (from gitdb<5,>=4.0.1->gitpython!=3.1.19,<4,>=3.0.7->streamlit) (5.0.1) Requirement already satisfied: MarkupSafe>=2.0 in c:\users\samar\anaconda3\lib\site-packages (from jinja2->altair<6,>=4.0->streamlit) (2.1.1) Requirement already satisfied: attrs>=17.4.0 in c:\users\samar\anaconda3\lib\site-packages (from jsonschema>=3.0->altair<6,>=4.0->streamlit) (22.1.0) Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in c:\users\samar\anaconda3\lib\site-packages (from jsonschema>=3.0->altair<6,>=4.0->streamlit) (0.18.0) Requirement already satisfied: mdurl~=0.1 in c:\users\samar\anaconda3\lib\site-packages (from markdown-it-py>=2.2.0->rich<14,>=10.14.0->streamlit) (0.1.0) Requirement already satisfied: six>=1.5 in c:\users\samar\anaconda3\lib\site-packages (from python-dateutil>=2.8.2->pandas<3,>=1.3.0->streamlit) (1.16.0)
#Basic exploration
# read the dataset from dir
df = pd.read_csv("diabetes.csv")
df.head()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
| 1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
| 2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
| 3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
| 4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 768 entries, 0 to 767 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Pregnancies 768 non-null int64 1 Glucose 768 non-null int64 2 BloodPressure 768 non-null int64 3 SkinThickness 768 non-null int64 4 Insulin 768 non-null int64 5 BMI 768 non-null float64 6 DiabetesPedigreeFunction 768 non-null float64 7 Age 768 non-null int64 8 Outcome 768 non-null int64 dtypes: float64(2), int64(7) memory usage: 54.1 KB
df.columns
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
dtype='object')
# descriptive statistics of the dataset
df.describe()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| count | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 |
| mean | 3.845052 | 120.894531 | 69.105469 | 20.536458 | 79.799479 | 31.992578 | 0.471876 | 33.240885 | 0.348958 |
| std | 3.369578 | 31.972618 | 19.355807 | 15.952218 | 115.244002 | 7.884160 | 0.331329 | 11.760232 | 0.476951 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.078000 | 21.000000 | 0.000000 |
| 25% | 1.000000 | 99.000000 | 62.000000 | 0.000000 | 0.000000 | 27.300000 | 0.243750 | 24.000000 | 0.000000 |
| 50% | 3.000000 | 117.000000 | 72.000000 | 23.000000 | 30.500000 | 32.000000 | 0.372500 | 29.000000 | 0.000000 |
| 75% | 6.000000 | 140.250000 | 80.000000 | 32.000000 | 127.250000 | 36.600000 | 0.626250 | 41.000000 | 1.000000 |
| max | 17.000000 | 199.000000 | 122.000000 | 99.000000 | 846.000000 | 67.100000 | 2.420000 | 81.000000 | 1.000000 |
# (row, columns)
df.shape
(768, 9)
# distribution of outcome variable
df.Outcome.value_counts()*100/len(df)
Outcome 0 65.104167 1 34.895833 Name: count, dtype: float64
df['Outcome'].value_counts()*100/len(df)
Outcome 0 65.104167 1 34.895833 Name: count, dtype: float64
#histogram to understand the distribution
import warnings
warnings.filterwarnings("ignore")
for i in df.select_dtypes(include="number").columns:
sns.histplot(data=df,x=i)
plt.show()
# plot the hist of the age variable
plt.figure(figsize=(8,7))
plt.xlabel('Age', fontsize=10)
plt.ylabel('Count', fontsize=10)
df['Age'].hist(edgecolor="black")
<Axes: xlabel='Age', ylabel='Count'>
df['Age'].max()
81
df['Age'].min()
21
print("MAX AGE: "+str(df['Age'].max()))
print("MIN AGE: "+str(df['Age'].min()))
MAX AGE: 81 MIN AGE: 21
df.columns
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
dtype='object')
# density graph
# 4*2=8
# columns=2 figure
# having 4 row
# [0,0], [0,1]
# [1,0], [1,1]
# [2,0], [2,1]
# [3,0], [3,1]
fig,ax = plt.subplots(4,2, figsize=(20,20))
sns.distplot(df.Pregnancies, bins=20, ax=ax[0,0], color="red")
sns.distplot(df.Glucose, bins=20, ax=ax[0,1], color="red")
sns.distplot(df.BloodPressure, bins=20, ax=ax[1,0], color="red")
sns.distplot(df.SkinThickness, bins=20, ax=ax[1,1], color="red")
sns.distplot(df.Insulin, bins=20, ax=ax[2,0], color="red")
sns.distplot(df.BMI, bins=20, ax=ax[2,1], color="red")
sns.distplot(df.DiabetesPedigreeFunction, bins=20, ax=ax[3,0], color="red")
sns.distplot(df.Age, bins=20, ax=ax[3,1], color="red")
<Axes: xlabel='Age', ylabel='Density'>
plt.figure(figsize=(20,6))
plt.subplot(1,3,1)
plt.title("Counter Plot")
sns.countplot(x = 'Pregnancies', data = df)
plt.subplot(1,3,2)
plt.title('Distribution Plot')
sns.distplot(df["Pregnancies"])
plt.subplot(1,3,3)
plt.title('Box Plot')
sns.boxplot(y=df["Pregnancies"])
plt.show()
df.columns
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
dtype='object')
df.groupby("Outcome").agg({'Pregnancies':'mean'})
| Pregnancies | |
|---|---|
| Outcome | |
| 0 | 3.298000 |
| 1 | 4.865672 |
df.groupby("Outcome").agg({'Pregnancies':'max'})
| Pregnancies | |
|---|---|
| Outcome | |
| 0 | 13 |
| 1 | 17 |
df.groupby("Outcome").agg({'Glucose':'mean'})
| Glucose | |
|---|---|
| Outcome | |
| 0 | 109.980000 |
| 1 | 141.257463 |
df.groupby("Outcome").agg({'Glucose':'max'})
| Glucose | |
|---|---|
| Outcome | |
| 0 | 197 |
| 1 | 199 |
# 'BloodPressure', 'SkinThickness', 'Insulin',
# 'BMI', 'DiabetesPedigreeFunction', 'Age'
# groupby-> mean/max
import matplotlib.pyplot as plt
import seaborn as sns
# Plot the count of each category in the 'Outcome' column
plt.figure(figsize=(12, 6))
sns.countplot(x='Outcome', data=df)
plt.title('Count of Outcome Categories')
plt.xlabel('Outcome')
plt.ylabel('Count')
f,ax = plt.subplots(1,2, figsize=(18,8))
df['Outcome'].value_counts().plot.pie(explode=[0,0.1],autopct = "%1.1f%%", ax=ax[0], shadow=True)
ax[0].set_title('target')
ax[0].set_ylabel('')
plt.show()
df.corr()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| Pregnancies | 1.000000 | 0.129459 | 0.141282 | -0.081672 | -0.073535 | 0.017683 | -0.033523 | 0.544341 | 0.221898 |
| Glucose | 0.129459 | 1.000000 | 0.152590 | 0.057328 | 0.331357 | 0.221071 | 0.137337 | 0.263514 | 0.466581 |
| BloodPressure | 0.141282 | 0.152590 | 1.000000 | 0.207371 | 0.088933 | 0.281805 | 0.041265 | 0.239528 | 0.065068 |
| SkinThickness | -0.081672 | 0.057328 | 0.207371 | 1.000000 | 0.436783 | 0.392573 | 0.183928 | -0.113970 | 0.074752 |
| Insulin | -0.073535 | 0.331357 | 0.088933 | 0.436783 | 1.000000 | 0.197859 | 0.185071 | -0.042163 | 0.130548 |
| BMI | 0.017683 | 0.221071 | 0.281805 | 0.392573 | 0.197859 | 1.000000 | 0.140647 | 0.036242 | 0.292695 |
| DiabetesPedigreeFunction | -0.033523 | 0.137337 | 0.041265 | 0.183928 | 0.185071 | 0.140647 | 1.000000 | 0.033561 | 0.173844 |
| Age | 0.544341 | 0.263514 | 0.239528 | -0.113970 | -0.042163 | 0.036242 | 0.033561 | 1.000000 | 0.238356 |
| Outcome | 0.221898 | 0.466581 | 0.065068 | 0.074752 | 0.130548 | 0.292695 | 0.173844 | 0.238356 | 1.000000 |
f,ax = plt.subplots(figsize=[20,15])
sns.heatmap(df.corr(), annot=True, fmt = '.2f', ax=ax, cmap='magma')
ax.set_title("Correlation Matrix", fontsize=20)
plt.show()
# EDA Part Completed
df.columns
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
dtype='object')
df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age']] = df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age']].replace(0, np.NaN)
# Data preprocessing Part
df.isnull().sum()
Pregnancies 111 Glucose 5 BloodPressure 35 SkinThickness 227 Insulin 374 BMI 11 DiabetesPedigreeFunction 0 Age 0 Outcome 0 dtype: int64
df.head()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 6.0 | 148.0 | 72.0 | 35.0 | NaN | 33.6 | 0.627 | 50 | 1 |
| 1 | 1.0 | 85.0 | 66.0 | 29.0 | NaN | 26.6 | 0.351 | 31 | 0 |
| 2 | 8.0 | 183.0 | 64.0 | NaN | NaN | 23.3 | 0.672 | 32 | 1 |
| 3 | 1.0 | 89.0 | 66.0 | 23.0 | 94.0 | 28.1 | 0.167 | 21 | 0 |
| 4 | NaN | 137.0 | 40.0 | 35.0 | 168.0 | 43.1 | 2.288 | 33 | 1 |
import missingno as msno
msno.bar(df, color="orange")
<Axes: >
#median
def median_target(var):
temp = df[df[var].notnull()]
temp = temp[[var, 'Outcome']].groupby(['Outcome'])[[var]].median().reset_index()
return temp
columns = df.columns
columns = columns.drop("Outcome")
for i in columns:
median_target(i)
df.loc[(df['Outcome'] == 0 ) & (df[i].isnull()), i] = median_target(i)[i][0]
df.loc[(df['Outcome'] == 1 ) & (df[i].isnull()), i] = median_target(i)[i][1]
df.head()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 6.0 | 148.0 | 72.0 | 35.0 | 169.5 | 33.6 | 0.627 | 50 | 1 |
| 1 | 1.0 | 85.0 | 66.0 | 29.0 | 102.5 | 26.6 | 0.351 | 31 | 0 |
| 2 | 8.0 | 183.0 | 64.0 | 32.0 | 169.5 | 23.3 | 0.672 | 32 | 1 |
| 3 | 1.0 | 89.0 | 66.0 | 23.0 | 94.0 | 28.1 | 0.167 | 21 | 0 |
| 4 | 5.0 | 137.0 | 40.0 | 35.0 | 168.0 | 43.1 | 2.288 | 33 | 1 |
df.isnull().sum()
Pregnancies 0 Glucose 0 BloodPressure 0 SkinThickness 0 Insulin 0 BMI 0 DiabetesPedigreeFunction 0 Age 0 Outcome 0 dtype: int64
# pair plot
p = sns.pairplot(df, hue="Outcome")
#Data preprocessing
# Outlier Detection
# IQR+Q1
# 50%
# 24.65->25%+50%
# 24.65->25%
for feature in df:
Q1 = df[feature].quantile(0.25)
Q3 = df[feature].quantile(0.75)
IQR = Q3-Q1
lower = Q1-1.5*IQR
upper = Q3+1.5*IQR
if df[(df[feature]>upper)].any(axis=None):
print(feature, "yes")
else:
print(feature, "no")
Pregnancies yes Glucose no BloodPressure yes SkinThickness yes Insulin yes BMI yes DiabetesPedigreeFunction yes Age yes Outcome no
#Boxplot to identify outliers
import warnings
warnings.filterwarnings("ignore")
for i in df.select_dtypes(include="number").columns:
sns.boxplot(data=df,x=i)
plt.show()
plt.figure(figsize=(8,7))
sns.boxplot(x= df["Insulin"], color="red")
<Axes: xlabel='Insulin'>
Q1 = df.Insulin.quantile(0.25)
Q3 = df.Insulin.quantile(0.75)
IQR = Q3-Q1
lower = Q1-1.5*IQR
upper = Q3+1.5*IQR
df.loc[df['Insulin']>upper, "Insulin"] = upper
plt.figure(figsize=(8,7))
sns.boxplot(x= df["Insulin"], color="red")
<Axes: xlabel='Insulin'>
# LOF
# local outlier factor
from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor(n_neighbors=10)
lof.fit_predict(df)
array([ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, -1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, -1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
-1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, -1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1])
df.head()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 6.0 | 148.0 | 72.0 | 35.0 | 169.5 | 33.6 | 0.627 | 50 | 1 |
| 1 | 1.0 | 85.0 | 66.0 | 29.0 | 102.5 | 26.6 | 0.351 | 31 | 0 |
| 2 | 8.0 | 183.0 | 64.0 | 32.0 | 169.5 | 23.3 | 0.672 | 32 | 1 |
| 3 | 1.0 | 89.0 | 66.0 | 23.0 | 94.0 | 28.1 | 0.167 | 21 | 0 |
| 4 | 5.0 | 137.0 | 40.0 | 35.0 | 168.0 | 43.1 | 2.288 | 33 | 1 |
plt.figure(figsize=(8,7))
sns.boxplot(x= df["Pregnancies"], color="red")
<Axes: xlabel='Pregnancies'>
df_scores = lof.negative_outlier_factor_
np.sort(df_scores)[0:20]
array([-3.06509976, -2.38250393, -2.15557018, -2.11501347, -2.08356175,
-1.95386655, -1.83559384, -1.74974237, -1.7330214 , -1.71017168,
-1.70215105, -1.68722889, -1.64294601, -1.64180205, -1.61181746,
-1.61067772, -1.60925053, -1.60214364, -1.59998552, -1.58761193])
thresold = np.sort(df_scores)[7]
thresold
-1.7497423670960557
outlier = df_scores>thresold
df = df[outlier]
df.head()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 6.0 | 148.0 | 72.0 | 35.0 | 169.5 | 33.6 | 0.627 | 50 | 1 |
| 1 | 1.0 | 85.0 | 66.0 | 29.0 | 102.5 | 26.6 | 0.351 | 31 | 0 |
| 2 | 8.0 | 183.0 | 64.0 | 32.0 | 169.5 | 23.3 | 0.672 | 32 | 1 |
| 3 | 1.0 | 89.0 | 66.0 | 23.0 | 94.0 | 28.1 | 0.167 | 21 | 0 |
| 4 | 5.0 | 137.0 | 40.0 | 35.0 | 168.0 | 43.1 | 2.288 | 33 | 1 |
df.shape
(760, 9)
plt.figure(figsize=(8,7))
sns.boxplot(x= df["Pregnancies"], color="red")
<Axes: xlabel='Pregnancies'>
#Feature Enginnering
NewBMI = pd.Series(["Underweight","Normal", "Overweight","Obesity 1", "Obesity 2", "Obesity 3"], dtype = "category")
NewBMI
0 Underweight 1 Normal 2 Overweight 3 Obesity 1 4 Obesity 2 5 Obesity 3 dtype: category Categories (6, object): ['Normal', 'Obesity 1', 'Obesity 2', 'Obesity 3', 'Overweight', 'Underweight']
df['NewBMI'] = NewBMI
df.loc[df["BMI"]<18.5, "NewBMI"] = NewBMI[0]
df.loc[(df["BMI"]>18.5) & df["BMI"]<=24.9, "NewBMI"] = NewBMI[1]
df.loc[(df["BMI"]>24.9) & df["BMI"]<=29.9, "NewBMI"] = NewBMI[2]
df.loc[(df["BMI"]>29.9) & df["BMI"]<=34.9, "NewBMI"] = NewBMI[3]
df.loc[(df["BMI"]>34.9) & df["BMI"]<=39.9, "NewBMI"] = NewBMI[4]
df.loc[df["BMI"]>39.9, "NewBMI"] = NewBMI[5]
df.head()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | NewBMI | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 6.0 | 148.0 | 72.0 | 35.0 | 169.5 | 33.6 | 0.627 | 50 | 1 | Obesity 2 |
| 1 | 1.0 | 85.0 | 66.0 | 29.0 | 102.5 | 26.6 | 0.351 | 31 | 0 | Obesity 2 |
| 2 | 8.0 | 183.0 | 64.0 | 32.0 | 169.5 | 23.3 | 0.672 | 32 | 1 | Obesity 2 |
| 3 | 1.0 | 89.0 | 66.0 | 23.0 | 94.0 | 28.1 | 0.167 | 21 | 0 | Obesity 2 |
| 4 | 5.0 | 137.0 | 40.0 | 35.0 | 168.0 | 43.1 | 2.288 | 33 | 1 | Obesity 3 |
# if insulin>=16 & insuline<=166->normal
def set_insuline(row):
if row["Insulin"]>=16 and row["Insulin"]<=166:
return "Normal"
else:
return "Abnormal"
df = df.assign(NewInsulinScore=df.apply(set_insuline, axis=1))
df.head()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | NewBMI | NewInsulinScore | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 6.0 | 148.0 | 72.0 | 35.0 | 169.5 | 33.6 | 0.627 | 50 | 1 | Obesity 2 | Abnormal |
| 1 | 1.0 | 85.0 | 66.0 | 29.0 | 102.5 | 26.6 | 0.351 | 31 | 0 | Obesity 2 | Normal |
| 2 | 8.0 | 183.0 | 64.0 | 32.0 | 169.5 | 23.3 | 0.672 | 32 | 1 | Obesity 2 | Abnormal |
| 3 | 1.0 | 89.0 | 66.0 | 23.0 | 94.0 | 28.1 | 0.167 | 21 | 0 | Obesity 2 | Normal |
| 4 | 5.0 | 137.0 | 40.0 | 35.0 | 168.0 | 43.1 | 2.288 | 33 | 1 | Obesity 3 | Abnormal |
# Some intervals were determined according to the glucose variable and these were assigned categorical variables.
NewGlucose = pd.Series(["Low", "Normal", "Overweight", "Secret", "High"], dtype = "category")
df["NewGlucose"] = NewGlucose
df.loc[df["Glucose"] <= 70, "NewGlucose"] = NewGlucose[0]
df.loc[(df["Glucose"] > 70) & (df["Glucose"] <= 99), "NewGlucose"] = NewGlucose[1]
df.loc[(df["Glucose"] > 99) & (df["Glucose"] <= 126), "NewGlucose"] = NewGlucose[2]
df.loc[df["Glucose"] > 126 ,"NewGlucose"] = NewGlucose[3]
df.head()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | NewBMI | NewInsulinScore | NewGlucose | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 6.0 | 148.0 | 72.0 | 35.0 | 169.5 | 33.6 | 0.627 | 50 | 1 | Obesity 2 | Abnormal | Secret |
| 1 | 1.0 | 85.0 | 66.0 | 29.0 | 102.5 | 26.6 | 0.351 | 31 | 0 | Obesity 2 | Normal | Normal |
| 2 | 8.0 | 183.0 | 64.0 | 32.0 | 169.5 | 23.3 | 0.672 | 32 | 1 | Obesity 2 | Abnormal | Secret |
| 3 | 1.0 | 89.0 | 66.0 | 23.0 | 94.0 | 28.1 | 0.167 | 21 | 0 | Obesity 2 | Normal | Normal |
| 4 | 5.0 | 137.0 | 40.0 | 35.0 | 168.0 | 43.1 | 2.288 | 33 | 1 | Obesity 3 | Abnormal | Secret |
# One hot encoding
df = pd.get_dummies(df, columns = ["NewBMI", "NewInsulinScore", "NewGlucose"], drop_first=True)
df.head()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | NewBMI_Obesity 1 | NewBMI_Obesity 2 | NewBMI_Obesity 3 | NewBMI_Overweight | NewBMI_Underweight | NewInsulinScore_Normal | NewGlucose_Low | NewGlucose_Normal | NewGlucose_Overweight | NewGlucose_Secret | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 6.0 | 148.0 | 72.0 | 35.0 | 169.5 | 33.6 | 0.627 | 50 | 1 | False | True | False | False | False | False | False | False | False | True |
| 1 | 1.0 | 85.0 | 66.0 | 29.0 | 102.5 | 26.6 | 0.351 | 31 | 0 | False | True | False | False | False | True | False | True | False | False |
| 2 | 8.0 | 183.0 | 64.0 | 32.0 | 169.5 | 23.3 | 0.672 | 32 | 1 | False | True | False | False | False | False | False | False | False | True |
| 3 | 1.0 | 89.0 | 66.0 | 23.0 | 94.0 | 28.1 | 0.167 | 21 | 0 | False | True | False | False | False | True | False | True | False | False |
| 4 | 5.0 | 137.0 | 40.0 | 35.0 | 168.0 | 43.1 | 2.288 | 33 | 1 | False | False | True | False | False | False | False | False | False | True |
# Specify the columns to be converted to 0 or 1
columns_to_convert = ['NewBMI_Obesity 1', 'NewBMI_Obesity 2', 'NewBMI_Obesity 3', 'NewBMI_Overweight', 'NewBMI_Underweight', 'NewInsulinScore_Normal', 'NewGlucose_Low', 'NewGlucose_Normal', 'NewGlucose_Overweight', 'NewGlucose_Secret']
for column in columns_to_convert:
df[column] = df[column].map({True: 1, False: 0})
df.head()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | NewBMI_Obesity 1 | NewBMI_Obesity 2 | NewBMI_Obesity 3 | NewBMI_Overweight | NewBMI_Underweight | NewInsulinScore_Normal | NewGlucose_Low | NewGlucose_Normal | NewGlucose_Overweight | NewGlucose_Secret | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 6.0 | 148.0 | 72.0 | 35.0 | 169.5 | 33.6 | 0.627 | 50 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 1.0 | 85.0 | 66.0 | 29.0 | 102.5 | 26.6 | 0.351 | 31 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 2 | 8.0 | 183.0 | 64.0 | 32.0 | 169.5 | 23.3 | 0.672 | 32 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 3 | 1.0 | 89.0 | 66.0 | 23.0 | 94.0 | 28.1 | 0.167 | 21 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 4 | 5.0 | 137.0 | 40.0 | 35.0 | 168.0 | 43.1 | 2.288 | 33 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
df.columns
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome', 'NewBMI_Obesity 1',
'NewBMI_Obesity 2', 'NewBMI_Obesity 3', 'NewBMI_Overweight',
'NewBMI_Underweight', 'NewInsulinScore_Normal', 'NewGlucose_Low',
'NewGlucose_Normal', 'NewGlucose_Overweight', 'NewGlucose_Secret'],
dtype='object')
categorical_df = df[['NewBMI_Obesity 1',
'NewBMI_Obesity 2', 'NewBMI_Obesity 3', 'NewBMI_Overweight',
'NewBMI_Underweight', 'NewInsulinScore_Normal', 'NewGlucose_Low',
'NewGlucose_Normal', 'NewGlucose_Overweight', 'NewGlucose_Secret']]
categorical_df.head()
| NewBMI_Obesity 1 | NewBMI_Obesity 2 | NewBMI_Obesity 3 | NewBMI_Overweight | NewBMI_Underweight | NewInsulinScore_Normal | NewGlucose_Low | NewGlucose_Normal | NewGlucose_Overweight | NewGlucose_Secret | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 3 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 4 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
df.head()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | NewBMI_Obesity 1 | NewBMI_Obesity 2 | NewBMI_Obesity 3 | NewBMI_Overweight | NewBMI_Underweight | NewInsulinScore_Normal | NewGlucose_Low | NewGlucose_Normal | NewGlucose_Overweight | NewGlucose_Secret | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 6.0 | 148.0 | 72.0 | 35.0 | 169.5 | 33.6 | 0.627 | 50 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 1.0 | 85.0 | 66.0 | 29.0 | 102.5 | 26.6 | 0.351 | 31 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 2 | 8.0 | 183.0 | 64.0 | 32.0 | 169.5 | 23.3 | 0.672 | 32 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 3 | 1.0 | 89.0 | 66.0 | 23.0 | 94.0 | 28.1 | 0.167 | 21 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 4 | 5.0 | 137.0 | 40.0 | 35.0 | 168.0 | 43.1 | 2.288 | 33 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
df.columns
Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome', 'NewBMI_Obesity 1',
'NewBMI_Obesity 2', 'NewBMI_Obesity 3', 'NewBMI_Overweight',
'NewBMI_Underweight', 'NewInsulinScore_Normal', 'NewGlucose_Low',
'NewGlucose_Normal', 'NewGlucose_Overweight', 'NewGlucose_Secret'],
dtype='object')
y=df['Outcome']
X=df.drop(['Outcome','NewBMI_Obesity 1',
'NewBMI_Obesity 2', 'NewBMI_Obesity 3', 'NewBMI_Overweight',
'NewBMI_Underweight', 'NewInsulinScore_Normal', 'NewGlucose_Low',
'NewGlucose_Normal', 'NewGlucose_Overweight', 'NewGlucose_Secret'], axis=1)
cols = X.columns
index = X.index
X.head()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | |
|---|---|---|---|---|---|---|---|---|
| 0 | 6.0 | 148.0 | 72.0 | 35.0 | 169.5 | 33.6 | 0.627 | 50 |
| 1 | 1.0 | 85.0 | 66.0 | 29.0 | 102.5 | 26.6 | 0.351 | 31 |
| 2 | 8.0 | 183.0 | 64.0 | 32.0 | 169.5 | 23.3 | 0.672 | 32 |
| 3 | 1.0 | 89.0 | 66.0 | 23.0 | 94.0 | 28.1 | 0.167 | 21 |
| 4 | 5.0 | 137.0 | 40.0 | 35.0 | 168.0 | 43.1 | 2.288 | 33 |
from sklearn.preprocessing import RobustScaler
transformer = RobustScaler().fit(X)
X=transformer.transform(X)
X=pd.DataFrame(X, columns = cols, index = index)
X.head()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0.75 | 0.775 | 0.000 | 1.000000 | 1.000000 | 0.177778 | 0.669707 | 1.235294 |
| 1 | -0.50 | -0.800 | -0.375 | 0.142857 | 0.000000 | -0.600000 | -0.049511 | 0.117647 |
| 2 | 1.25 | 1.650 | -0.500 | 0.571429 | 1.000000 | -0.966667 | 0.786971 | 0.176471 |
| 3 | -0.50 | -0.700 | -0.375 | -0.714286 | -0.126866 | -0.433333 | -0.528990 | -0.470588 |
| 4 | 0.50 | 0.500 | -2.000 | 1.000000 | 0.977612 | 1.233333 | 4.998046 | 0.235294 |
X = pd.concat([X, categorical_df], axis=1)
X.head()
| Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | NewBMI_Obesity 1 | NewBMI_Obesity 2 | NewBMI_Obesity 3 | NewBMI_Overweight | NewBMI_Underweight | NewInsulinScore_Normal | NewGlucose_Low | NewGlucose_Normal | NewGlucose_Overweight | NewGlucose_Secret | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.75 | 0.775 | 0.000 | 1.000000 | 1.000000 | 0.177778 | 0.669707 | 1.235294 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | -0.50 | -0.800 | -0.375 | 0.142857 | 0.000000 | -0.600000 | -0.049511 | 0.117647 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 2 | 1.25 | 1.650 | -0.500 | 0.571429 | 1.000000 | -0.966667 | 0.786971 | 0.176471 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 3 | -0.50 | -0.700 | -0.375 | -0.714286 | -0.126866 | -0.433333 | -0.528990 | -0.470588 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 4 | 0.50 | 0.500 | -2.000 | 1.000000 | 0.977612 | 1.233333 | 4.998046 | 0.235294 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
x=df.drop('Outcome', axis=1)
y=df['Outcome']
X_train, X_test, y_train , y_test = train_test_split(X,y, test_size=0.2, random_state=0)
LogisticRegression()
SVC()
RandomForestClassifier(n_estimators=1000,class_weight='balanced')
GradientBoostingClassifier(n_estimators=1000)
GradientBoostingClassifier(n_estimators=1000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GradientBoostingClassifier(n_estimators=1000)
scaler =StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Machine Learning Algo
# Machine Learning models
# Logistic Regreesion
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression()
y_pred = log_reg.predict(X_test)
accuracy_score(y_train, log_reg.predict(X_train))
0.8470394736842105
log_reg_acc = accuracy_score(y_test, log_reg.predict(X_test))
from sklearn.metrics import confusion_matrix, accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns
def predict_and_plot(model, inputs, targets, name=''):
preds = model.predict(inputs)
accuracy = accuracy_score(targets, preds)
print("Accuracy: {:.2f}%".format(accuracy * 100))
cf = confusion_matrix(targets, preds, normalize='true')
plt.figure()
sns.heatmap(cf, annot=True)
plt.xlabel('Prediction')
plt.ylabel('Target')
plt.title('{} Confusion Matrix'.format(name))
return preds
# Predict and plot on the training data
train_preds = predict_and_plot(log_reg, X_train, y_train, 'Train')
# Predict and plot on the validation data
val_preds = predict_and_plot(log_reg, X_test, y_test, 'Validation')
Accuracy: 84.70% Accuracy: 89.47%
confusion_matrix(y_test, y_pred)
array([[88, 10],
[ 6, 48]], dtype=int64)
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.94 0.90 0.92 98
1 0.83 0.89 0.86 54
accuracy 0.89 152
macro avg 0.88 0.89 0.89 152
weighted avg 0.90 0.89 0.90 152
# KNN
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(accuracy_score(y_train, knn.predict(X_train)))
knn_acc = accuracy_score(y_test, knn.predict(X_test))
print(accuracy_score(y_test, knn.predict(X_test)))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
0.875
0.881578947368421
[[88 10]
[ 8 46]]
precision recall f1-score support
0 0.92 0.90 0.91 98
1 0.82 0.85 0.84 54
accuracy 0.88 152
macro avg 0.87 0.87 0.87 152
weighted avg 0.88 0.88 0.88 152
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train, y_train)
y_train_pred = knn_model.predict(X_train)
y_val_pred = knn_model.predict(X_val)
train_accuracy = accuracy_score(y_train, y_train_pred)
val_accuracy = accuracy_score(y_val, y_val_pred)
print("Training Accuracy:", train_accuracy)
print("Validation Accuracy:", val_accuracy)
confusion = confusion_matrix(y_val, y_val_pred)
plt.figure(figsize=(6, 4))
sns.heatmap(confusion, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix (Validation)')
plt.show()
Training Accuracy: 0.8717105263157895 Validation Accuracy: 0.8421052631578947
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
param_grid = {
'n_neighbors': [1, 3, 5, 7, 9]
}
knn_model = KNeighborsClassifier()
grid_search = GridSearchCV(knn_model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
y_train_pred = best_model.predict(X_train)
y_val_pred = best_model.predict(X_val)
train_accuracy = accuracy_score(y_train, y_train_pred)
val_accuracy = accuracy_score(y_val, y_val_pred)
print("Training Accuracy with Best Hyperparameters:", train_accuracy)
print("Validation Accuracy with Best Hyperparameters:", val_accuracy)
Training Accuracy with Best Hyperparameters: 0.8634868421052632 Validation Accuracy with Best Hyperparameters: 0.8486842105263158
# SVM
svc = SVC(probability=True)
parameter = {
"gamma":[0.0001, 0.001, 0.01, 0.1],
'C': [0.01, 0.05,0.5, 0.01, 1, 10, 15, 20]
}
grid_search = GridSearchCV(svc, parameter)
grid_search.fit(X_train, y_train)
GridSearchCV(estimator=SVC(probability=True),
param_grid={'C': [0.01, 0.05, 0.5, 0.01, 1, 10, 15, 20],
'gamma': [0.0001, 0.001, 0.01, 0.1]})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(estimator=SVC(probability=True),
param_grid={'C': [0.01, 0.05, 0.5, 0.01, 1, 10, 15, 20],
'gamma': [0.0001, 0.001, 0.01, 0.1]})SVC(probability=True)
SVC(probability=True)
# best_parameter
grid_search.best_params_
{'C': 10, 'gamma': 0.1}
grid_search.best_score_
0.8618615363771847
svc = SVC(C=10, gamma = 0.01, probability=True)
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
print(accuracy_score(y_train, svc.predict(X_train)))
svc_acc = accuracy_score(y_test, svc.predict(X_test))
print(accuracy_score(y_test, svc.predict(X_test)))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
0.8618421052631579
0.9013157894736842
[[88 10]
[ 5 49]]
precision recall f1-score support
0 0.95 0.90 0.92 98
1 0.83 0.91 0.87 54
accuracy 0.90 152
macro avg 0.89 0.90 0.89 152
weighted avg 0.91 0.90 0.90 152
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)
y_train_pred = svm_model.predict(X_train)
y_val_pred = svm_model.predict(X_val)
train_accuracy = accuracy_score(y_train, y_train_pred)
val_accuracy = accuracy_score(y_val, y_val_pred)
print("Training Accuracy:", train_accuracy)
print("Validation Accuracy:", val_accuracy)
train_confusion = confusion_matrix(y_train, y_train_pred)
val_confusion = confusion_matrix(y_val, y_val_pred)
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.heatmap(train_confusion, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix (Training)')
plt.subplot(1, 2, 2)
sns.heatmap(val_confusion, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix (Validation)')
plt.show()
Training Accuracy: 0.837171052631579 Validation Accuracy: 0.8552631578947368
# Decision Tree
DT = DecisionTreeClassifier()
DT.fit(X_train, y_train)
y_pred = DT.predict(X_test)
print(accuracy_score(y_train, DT.predict(X_train)))
print(accuracy_score(y_test, DT.predict(X_test)))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
1.0
0.7763157894736842
[[83 15]
[19 35]]
precision recall f1-score support
0 0.81 0.85 0.83 98
1 0.70 0.65 0.67 54
accuracy 0.78 152
macro avg 0.76 0.75 0.75 152
weighted avg 0.77 0.78 0.77 152
# hyperparameter tuning of dt
grid_param = {
'criterion':['gini','entropy'],
'max_depth' : [3,5,7,10],
'splitter' : ['best','radom'],
'min_samples_leaf':[1,2,3,5,7],
'min_samples_split':[1,2,3,5,7],
'max_features':['auto','sqrt','log2']
}
grid_search_dt = GridSearchCV(DT, grid_param, cv=50, n_jobs=-1, verbose = 1)
grid_search_dt.fit(X_train, y_train)
Fitting 50 folds for each of 1200 candidates, totalling 60000 fits
GridSearchCV(cv=50, estimator=DecisionTreeClassifier(), n_jobs=-1,
param_grid={'criterion': ['gini', 'entropy'],
'max_depth': [3, 5, 7, 10],
'max_features': ['auto', 'sqrt', 'log2'],
'min_samples_leaf': [1, 2, 3, 5, 7],
'min_samples_split': [1, 2, 3, 5, 7],
'splitter': ['best', 'radom']},
verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=50, estimator=DecisionTreeClassifier(), n_jobs=-1,
param_grid={'criterion': ['gini', 'entropy'],
'max_depth': [3, 5, 7, 10],
'max_features': ['auto', 'sqrt', 'log2'],
'min_samples_leaf': [1, 2, 3, 5, 7],
'min_samples_split': [1, 2, 3, 5, 7],
'splitter': ['best', 'radom']},
verbose=1)DecisionTreeClassifier()
DecisionTreeClassifier()
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
param_grid = {
'max_depth': [5],
'min_samples_split': [3],
'min_samples_leaf': [7],
'criterion': ['gini', 'entropy']
}
decision_tree_model = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(decision_tree_model, param_grid, cv=5, n_jobs=-1, scoring='accuracy')
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
best_model.fit(X_train, y_train)
train_accuracy = best_model.score(X_train, y_train)
val_accuracy = best_model.score(X_test, y_test)
print("Training Accuracy:", train_accuracy)
print("Validation Accuracy:", val_accuracy)
Training Accuracy: 0.9210526315789473 Validation Accuracy: 0.7763157894736842
grid_search_dt.best_params_
{'criterion': 'entropy',
'max_depth': 7,
'max_features': 'log2',
'min_samples_leaf': 3,
'min_samples_split': 3,
'splitter': 'best'}
grid_search_dt.best_score_
0.8821794871794871
DT = grid_search_dt.best_estimator_
y_pred = DT.predict(X_test)
print(accuracy_score(y_train, DT.predict(X_train)))
dt_acc = accuracy_score(y_test, DT.predict(X_test))
print(accuracy_score(y_test, DT.predict(X_test)))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
0.8799342105263158
0.8618421052631579
[[92 6]
[15 39]]
precision recall f1-score support
0 0.86 0.94 0.90 98
1 0.87 0.72 0.79 54
accuracy 0.86 152
macro avg 0.86 0.83 0.84 152
weighted avg 0.86 0.86 0.86 152
rand_clf = RandomForestClassifier(criterion = 'entropy', max_depth = 15, max_features = 0.75, min_samples_leaf = 2, min_samples_split = 3, n_estimators = 130)
rand_clf.fit(X_train, y_train)
RandomForestClassifier(criterion='entropy', max_depth=15, max_features=0.75,
min_samples_leaf=2, min_samples_split=3,
n_estimators=130)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(criterion='entropy', max_depth=15, max_features=0.75,
min_samples_leaf=2, min_samples_split=3,
n_estimators=130)y_pred = rand_clf.predict(X_test)
y_pred = rand_clf.predict(X_test)
print(accuracy_score(y_train, rand_clf.predict(X_train)))
rand_acc = accuracy_score(y_test, rand_clf.predict(X_test))
print(accuracy_score(y_test, rand_clf.predict(X_test)))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
0.9967105263157895
0.881578947368421
[[92 6]
[12 42]]
precision recall f1-score support
0 0.88 0.94 0.91 98
1 0.88 0.78 0.82 54
accuracy 0.88 152
macro avg 0.88 0.86 0.87 152
weighted avg 0.88 0.88 0.88 152
gbc = GradientBoostingClassifier()
parameters = {
'loss': ['deviance', 'exponential'],
'learning_rate': [0.001, 0.1, 1, 10],
'n_estimators': [100, 150, 180, 200]
}
grid_search_gbc = GridSearchCV(gbc, parameters, cv = 10, n_jobs = -1, verbose = 1)
grid_search_gbc.fit(X_train, y_train)
Fitting 10 folds for each of 32 candidates, totalling 320 fits
GridSearchCV(cv=10, estimator=GradientBoostingClassifier(), n_jobs=-1,
param_grid={'learning_rate': [0.001, 0.1, 1, 10],
'loss': ['deviance', 'exponential'],
'n_estimators': [100, 150, 180, 200]},
verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=10, estimator=GradientBoostingClassifier(), n_jobs=-1,
param_grid={'learning_rate': [0.001, 0.1, 1, 10],
'loss': ['deviance', 'exponential'],
'n_estimators': [100, 150, 180, 200]},
verbose=1)GradientBoostingClassifier()
GradientBoostingClassifier()
grid_search_gbc.best_params_
{'learning_rate': 0.1, 'loss': 'exponential', 'n_estimators': 100}
grid_search_gbc.best_score_
0.8981147540983606
gbc = GradientBoostingClassifier(learning_rate = 0.1, loss = 'exponential', n_estimators = 150)
gbc.fit(X_train, y_train)
GradientBoostingClassifier(loss='exponential', n_estimators=150)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GradientBoostingClassifier(loss='exponential', n_estimators=150)
gbc = grid_search_gbc.best_estimator_
y_pred = gbc.predict(X_test)
print(accuracy_score(y_train, gbc.predict(X_train)))
gbc_acc = accuracy_score(y_test, gbc.predict(X_test))
print(accuracy_score(y_test, gbc.predict(X_test)))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
0.9802631578947368
0.8421052631578947
[[94 4]
[20 34]]
precision recall f1-score support
0 0.82 0.96 0.89 98
1 0.89 0.63 0.74 54
accuracy 0.84 152
macro avg 0.86 0.79 0.81 152
weighted avg 0.85 0.84 0.83 152
from xgboost import XGBClassifier
xgb = XGBClassifier(objective = 'binary:logistic', learning_rate = 0.01, max_depth = 10, n_estimators = 180)
xgb.fit(X_train, y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.01, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=10, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=180, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=0.01, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=10, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=180, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)y_pred = xgb.predict(X_test)
print(accuracy_score(y_train, xgb.predict(X_train)))
xgb_acc = accuracy_score(y_test, xgb.predict(X_test))
print(accuracy_score(y_test, xgb.predict(X_test)))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
0.9819078947368421
0.8552631578947368
[[95 3]
[19 35]]
precision recall f1-score support
0 0.83 0.97 0.90 98
1 0.92 0.65 0.76 54
accuracy 0.86 152
macro avg 0.88 0.81 0.83 152
weighted avg 0.86 0.86 0.85 152
# Model Comparison
models = pd.DataFrame({
'Model': ['Logistic Regression', 'KNN', 'SVM', 'Decision Tree Classifier', 'Random Forest Classifier', 'Gradient Boosting Classifier', 'XgBoost'],
'Score': [100*round(log_reg_acc,4), 100*round(knn_acc,4), 100*round(svc_acc,4), 100*round(dt_acc,4), 100*round(rand_acc,4),
100*round(gbc_acc,4), 100*round(xgb_acc,4)]
})
models.sort_values(by = 'Score', ascending = False)
| Model | Score | |
|---|---|---|
| 2 | SVM | 90.13 |
| 0 | Logistic Regression | 89.47 |
| 1 | KNN | 88.16 |
| 4 | Random Forest Classifier | 88.16 |
| 3 | Decision Tree Classifier | 86.18 |
| 6 | XgBoost | 85.53 |
| 5 | Gradient Boosting Classifier | 84.21 |
import pickle
model = gbc_acc
pickle.dump(model, open("diabetes.pkl",'wb'))
from sklearn import metrics
plt.figure(figsize=(8,5))
models = [
{
'label': 'LR',
'model': log_reg,
},
{
'label': 'DT',
'model': DT,
},
{
'label': 'SVM',
'model': svc,
},
{
'label': 'KNN',
'model': knn,
},
{
'label': 'XGBoost',
'model': xgb,
},
{
'label': 'RF',
'model': rand_clf,
},
{
'label': 'GBDT',
'model': gbc,
}
]
for m in models:
model = m['model']
model.fit(X_train, y_train)
y_pred=model.predict(X_test)
fpr1, tpr1, thresholds = metrics.roc_curve(y_test, model.predict_proba(X_test)[:,1])
auc = metrics.roc_auc_score(y_test,model.predict(X_test))
plt.plot(fpr1, tpr1, label='%s - ROC (area = %0.2f)' % (m['label'], auc))
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([-0.01, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('1 - Specificity (False Positive Rate)', fontsize=12)
plt.ylabel('Sensitivity (True Positive Rate)', fontsize=12)
plt.title('ROC - Diabetes Prediction', fontsize=12)
plt.legend(loc="lower right", fontsize=12)
plt.savefig("roc_diabetes.jpeg", format='jpeg', dpi=400, bbox_inches='tight')
plt.show()
from sklearn import metrics
import numpy as np
import matplotlib.pyplot as plt
models = [
{
'label': 'LR',
'model': log_reg,
},
{
'label': 'DT',
'model': DT,
},
{
'label': 'SVM',
'model': svc,
},
{
'label': 'KNN',
'model': knn,
},
{
'label': 'XGBoost',
'model': xgb,
},
{
'label': 'RF',
'model': rand_clf,
},
{
'label': 'GBDT',
'model': gbc,
}
]
means_roc = []
means_accuracy = [100*round(log_reg_acc,4), 100*round(dt_acc,4), 100*round(svc_acc,4), 100*round(knn_acc,4), 100*round(xgb_acc,4),
100*round(rand_acc,4), 100*round(gbc_acc,4)]
for m in models:
model = m['model']
model.fit(X_train, y_train)
y_pred=model.predict(X_test)
fpr1, tpr1, thresholds = metrics.roc_curve(y_test, model.predict_proba(X_test)[:,1])
auc = metrics.roc_auc_score(y_test,model.predict(X_test))
auc = 100*round(auc,4)
means_roc.append(auc)
print(means_accuracy)
print(means_roc)
# data to plot
n_groups = 7
means_accuracy = tuple(means_accuracy)
means_roc = tuple(means_roc)
# create plot
fig, ax = plt.subplots(figsize=(8,5))
index = np.arange(n_groups)
bar_width = 0.35
opacity = 0.8
rects1 = plt.bar(index, means_accuracy, bar_width,
alpha=opacity,
color='mediumpurple',
label='Accuracy (%)')
rects2 = plt.bar(index + bar_width, means_roc, bar_width,
alpha=opacity,
color='rebeccapurple',
label='ROC (%)')
plt.xlim([-1, 8])
plt.ylim([60, 95])
plt.title('Performance Evaluation - Diabetes Prediction', fontsize=12)
plt.xticks(index, (' LR', ' DT', ' SVM', ' KNN', 'XGBoost' , ' RF', ' GBDT'), rotation=40, ha='center', fontsize=12)
plt.legend(loc="upper right", fontsize=10)
plt.savefig("PE_diabetes.jpeg", format='jpeg', dpi=400, bbox_inches='tight')
plt.show()
[89.47, 86.18, 90.13, 88.16000000000001, 85.53, 88.16000000000001, 84.21] [87.11, 78.03999999999999, 90.27, 89.44, 80.88, 83.47, 79.44]
import pickle
filename='diabetes.sav'
pickle.dump(model, open(filename,'wb'))
I have done the "Diabetes Prediction" task. Originally, the dataset had 769 records. Besides, creating 7 different models and tuning their parameters were very useful for evaluating their performance to decide which one is the most effective for predicting diabetes disease. As mentioned above, because of the unique characteristics of the medical industry, correctly predicting diseases such as diabetes is crucial so choosing which model provided the highest value of Recall should be on the top of priority, also in the feature engineering i create new columns "NewBMI_Obesity1','NewBMI_Obesity2','NewBMI_Obesity'NewBMI_Overweight','NewBMI_Underweight','NewInsulinScore_Normal','NewGlucose _Low','NewGlucose_Normal''NewGlucose_Overweight', 'NewGlucose_Secret'], enhance the predictive power of the dataset. Finally, as showed in the model score the Gradient Boosting Classifier provided the highest score (91.45%),